Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier model while also reflecting real-world usage. Yet, current paradigms face a difficulty–realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems.
This work explores a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce , a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to less explored areas like sci-fi and history, probing capabilities including reasoning, factuality, and browsing.
is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value.
Rank | System | Organization | UQ-Validator Pass Rate | All Questions | Technology | Culture & Recreation | Life & Arts | Science |
---|---|---|---|---|---|---|---|---|
#1 | o3 Pro | OpenAI | 75 / 500 (15.0%) | 4 / 500* | 0 / 52 | 0 / 16 | 0 / 35 | 4 / 395 |
#2 | Gemini 2.5 Pro | 25 / 500 (5.0%) | 3 / 500* | 0 / 52 | 0 / 16 | 0 / 35 | 3 / 395 | |
#3 | o4 mini | OpenAI | 25 / 500 (5.0%) | 2 / 500* | 0 / 52 | 0 / 16 | 0 / 35 | 2 / 395 |
#4 | o3 | OpenAI | 44 / 500 (8.8%) | 1 / 500* | 1 / 52 | 0 / 16 | 0 / 35 | 0 / 395 |
#5 | DeepSeek R1 | DeepSeek | 11 / 500 (2.2%) | 1 / 500* | 0 / 52 | 0 / 16 | 0 / 35 | 1 / 395 |
#6 | GPT-5 | OpenAI | 88 / 500 (17.6%) | 0 / 500* | 0 / 52 | 0 / 16 | 0 / 35 | 0 / 395 |
#7 | Claude Opus 4 | Anthropic | 7 / 500 (1.4%) | 0 / 500* | 0 / 52 | 0 / 16 | 0 / 35 | 0 / 395 |
#8 | Claude 3.7 Sonnet | Anthropic | 6 / 500 (1.2%) | 0 / 500* | 0 / 52 | 0 / 16 | 0 / 35 | 0 / 395 |
Found an interesting Stack Exchange unsolved question? You can check if it's in our UQ dataset by modifying the URL:
Original Stack Exchange URL:
https://math.stackexchange.com/questions/358423
UQ Mirrored URL:
https://uq.stanford.edu/q/math.stackexchange.com/questions/358423
✅ If question exists in UQ:
You'll be automatically redirected to the UQ question page with model answers and analysis.
📝 If question not found:
You can submit a request to have it considered for inclusion in our dataset.
Tip: Both short URLs (without title) and full URLs (with title) work the same way!
The most popular questions from the UQ Project based on Stack Exchange votes
A proof of without prime ideals?
A proof of
Background. If
Is there a bijection of with itself such that the forward map is connected but the inverse is not?
Is there a bijection of
Let
Given a finite extension of the rationals,
Let
Let
I was curious about the sum of two consecutive primes and after proving that the sum for the odd primes always has at least 3 prime divisors, I came up with this question:
Find the least natural numb...
Say that the perimeter of a polyhedron is the sum of its edge lengths. What is the maximum volume of a polyhedron with a unit perimeter? A reasonable first guess would be the regular tetrahedron of si...
Suppose for an arbitrary group word
Does there exist a complete, finitely axiomatizable, first-order theory
The series
News
- [08/2025] Released Unsolved Questions (UQ) Paper
For questions about the project:
{"niefan, kzliu, niklasm"}@stanford.edu
For technical issues:
niefan@stanford.edu
If you use UQ: Assessing Language Models on Unsolved Questions, please cite:
@misc{nie2025uqassessinglanguagemodels, title={UQ: Assessing Language Models on Unsolved Questions}, author={Fan Nie and Ken Ziyu Liu and Zihao Wang and Rui Sun and Wei Liu and Weijia Shi and Huaxiu Yao and Linjun Zhang and Andrew Y. Ng and James Zou and Sanmi Koyejo and Yejin Choi and Percy Liang and Niklas Muennighoff}, year={2025}, eprint={2508.17580}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.17580} }